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Abstract 

We theoretically analyze and compare the following five popular multiclass classification methods: 
One vs. All, All Pairs, Tree-based classifiers, Error Correcting Output Codes (ECOC) with randomly 
generated code matrices, and Multiclass SVM. In the first four methods, the classification is based on a 
reduction to binary classification. We consider the case where the binary classifier comes from a class of 
VC dimension d, and in particular from the class of half spaces over R d . We analyze both the estimation 
error and the approximation error of these methods. Our analysis reveals interesting conclusions of 
practical relevance, regarding the success of the different approaches under various conditions. Our proof 
technique employs tools from VC theory to analyze the approximation error of hypothesis classes. This 
is in sharp contrast to most, if not all, previous uses of VC theory, which only deal with estimation error. 

1 Introduction 

In this work we consider multiclass prediction: The problem of classifying objects into one of several 
possible target classes. Applications include, for example, categorizing documents according to topic, and 
determining which object appears in a given image. We assume that objects (a.k.a. instances) are vectors in 
X = W l and the class labels come from the set y = [k] = {1, . . . , k}. Following the standard PAC model, 
the learner receives a training set of m examples, drawn i.i.d. from some unknown distribution, and should 
output a classifier which maps X to 3^- 

The centrality of the multiclass learning problem has spurred the development of various approaches for 
tackling the task. Perhaps the most straightforward approach is a reduction from multiclass classification to 
binary classification. For example, the One-vs-All (OvA) method is based on a reduction of the multiclass 
problem into k binary problems, each of which discriminates between one class to all the rest of the classes 
(e.g. Rumelhart et al. [1986]). A different reduction is the All-Pairs (AP) approach in which all pairs of 
classes are compared to each other [Hastie and Tibshirani, 1998]. These two approaches have been unified 
under the framework of Error Correction Output Codes (ECOC) [Dietterich and Bakiri, 1995, Allwein et al., 
2000]. A tree-based classifier (TC) is another reduction in which the prediction is obtained by traversing a 
binary tree, where at each node of the tree a binary classifier is used to decide on the rest of the path (see 
for example Beygelzimer et al. [2007]). 

All of the above methods are based on reductions to binary classification. We pay special attention to the 
case where the underlying binary classifiers are linear separators (halfspaces). Formally, each w G 
defines the linear separator h w (x) — sign((io, x)), where x = (x,l) € R d+1 is the concatenation of 
the vector x and the scalar 1, While halfspaces are our primary focus, many of our results hold for any 
underlying binary hypothesis class of VC dimension d + 1. 
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Other, more direct approaches to multiclass classification over M. d have also been proposed (e.g. Vapnik 
[1998], Weston and Watkins [1999], Crammer and Singer [2001]). In this paper we analyze the Multi- 
class SVM (MSVM) formulation of Crammer and Singer [2001], in which each hypothesis is of the form 
hw[x) = argmax ie j fc ] (Wx)i, where W is a k x (d+1) matrix and (Wx)i is the i'th element of the vector 
Wx e R k . 

We theoretically analyze the prediction performance of the aforementioned methods, namely, OvA, AP, 
ECOC, TC, and MSVM. The error of a multiclass predictor h : R d -» [k] is defined to be the probability 
that h(x) 7^ y, where (x, y) is sampled from the underlying distribution T> over M. d x [k], namely, Err(/i) = 
^(x,y)~v [h{x) y]- Our main goal is to understand which method is preferable in terms of the error it will 
achieve, based on easy-to-verify properties of the problem at hand. 

Our analysis pertains to the type of classifiers each method can potentially find, and does not depend on 
the specific training algorithm. More precisely, each method corresponds to a hypothesis class, H, which 
contains the multiclass predictors that may be returned by the method. For example, the hypothesis class of 
MSVM is H = {x m. argmax ie[fe] (W^)i : W <= M fex ( d+1 )}. 

A learning algorithm, A, receives a training set, S = {(xj, 2/j)}£Li> sampled i.i.d. according to V, and 
returns a multiclass predictor which we denote by A(S) € H. A learning algorithm is called an Empirical 
Risk Minimizer (ERM) if it returns a hypothesis in H that minimizes the empirical error on the sample. We 
denote by h* a hypothesis in H with minimal error, 1 that is, h* E argmin, ieW Err(/i). 

When analyzing the error of A(S), it is convenient to decompose this error as a sum of approximation 
error and estimation error: 

Err(A(S*)) = Err(h*) +Err(A(5)) - Err(h*) . (1) 

approximation estimation 

• The approximation error is the minimum error achievable by a predictor in the hypothesis class, H. 
The approximation error does not depend on the sample size, and is determined solely by the allowed 
hypothesis class. 

• The estimation error of an algorithm is the difference between the approximation error, and the error 
of the classifier the algorithm chose based on the sample. This error exists both for statistical reasons, 
since the sample may not be large enough to determine the best hypothesis, and for algorithmic 
reasons, since the learning algorithm may not output the best possible hypothesis given the sample. 
For the ERM algorithm, the estimation error can be bounded from above by order of \JC(Ji)jm 
where C(H) is a complexity measure of T~L (analogous to the VC dimension) and m is the sample 
size. A similar term also bounds the estimation error from below for any algorithm. Thus C(H) is 
an estimate of the best achievable estimation error for the class. 

When studying the estimation error of different methods, we follow the standard distribution-free anal- 
ysis. Namely, we will compare the algorithms based on the worst-case estimation error, where worst-case 
is over all possible distributions T>. Such an analysis can lead us to the following type of conclusion: If two 
hypothesis classes have roughly the same complexity, C(Hi) ~ C(%2)> an d the number of available train- 
ing examples is significantly larger than this value of complexity, then for both hypothesis classes we are 
going to have a small estimation error. Hence, in this case the difference in prediction performance between 
the two methods will be dominated by the approximation error and by the success of the learning algorithm 
in approaching the best possible estimation error. In our discussion below we disregard possible differences 
in optimality which stem from algorithmic aspects and implementation details. A rigorous comparison of 
training heuristics would certainly be of interest and is left to future work. 

For the approximation error we will provide even stronger results, by comparing the approximation 
error of classes for any distribution. We rely on the following definition. 

1 For simplicity, we assume that the minimum is attainable. 
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Definition 1.1. Given two hypothesis classes, W,W, we say that H essentially contains W if for any 
distribution, the approximation error ofH is at most the approximation error ofH'. Ti- strictly contains %' 
if, in addition, there is a distribution for which the approximation error ofH is strictly smaller than that of 

w. 

Our main findings are as follows (see a full comparison in Table 1). The formal statements are given in 
Section 3. 

• The estimation errors of OvA, MSVM, and TC are all roughly the same, in the sense that C(H) = 
Q(dk) for all of the corresponding hypothesis classes. The complexity of AP is Q(dk 2 ). The com- 
plexity of ECOC with a code of length I and code-distance S is at most 0(dl) and at least dS/2. 
It follows that for randomly generated codes, C(H) = <d(dl). Note that this analysis shows that 
a larger code-distance yields a larger estimation error and might therefore hurt performance. This 
contrasts with previous "reduction-based" analyses of ECOC, which concluded that a larger code 
distance improves performance. 

• We prove that the hypothesis class of MSVM essentially contains the hypothesis classes of both OvA 
and TC. Moreover, these inclusions are strict. Since the estimation errors of these three methods 
are roughly the same, it follows that the MSVM method dominates both OvA and TC in terms of 
achievable prediction performance. 

• In the TC method, one needs to associate each leaf of the tree to a label. If no prior knowledge on 
how to break the symmetry is known, it is suggested in Beygelzimer et al. [2007] to break symmetry 
by choosing a random permutation of the labels. We show that whenever d <C k, for any distribution 
V, with high probability over the choice of a random permutation, the approximation error of the 
resulting tree would be close to 1/2. It follows that a random choice of a permutation is likely to 
yield a poor predictor. 

• We show that if d <C k, for any distribution V, the approximation error of ECOC with a randomly 
generated code matrix is likely to be close to 1/2. 

• We show that the hypothesis class of AP essentially contains the hypothesis class of MSVM (hence 
also that of OvA and TC), and that there can be a substantial gap in the containment. Therefore, as 
expected, the relative performance of AP and MSVM depends on the well-known trade-off between 
estimation error and approximation error. 
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Table 1 : Summary of comparison 



The above findings suggest that in terms of performance, it may be wiser to choose MSVM over OvA 
and TC, and especially so when d <C k. We note, however, that in some situations (e.g. d = k) the 
prediction success of these methods can be similar, while TC has the advantage of having a testing run-time 
of dlog(fc), compared to the testing run-time of dk for OvA and MSVM. In addition, TC and ECOC may 
be a good choice when there is additional prior knowledge on the distribution or on how to break symmetry 
between the different labels. 
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1.1 Related work 



Allwein et al. [2000] analyzed the multiclass error of ECOC as a function of the binary error. The problem 
with such a "reduction-based" analysis is that such analysis becomes problematic if the underlying binary 
problems are very hard. Indeed, our analysis reveals that the underlying binary problems would be too 
hard if cZ fc and the code is randomly generated. The experiments in Allwein et al. [2000] show that 
when using kernel-based SVM or AdaBoost as the underlying classifier, OvA is inferior to random ECOC. 
However, in their experiments, the number of classes is small relative to the dimension of the feature space, 
especially if working with kernels or with combinations of weak learners. 

Crammer and Singer [2001] presented experiments demonstrating that MSVM outperforms OvA on 
several data sets. Rifkin and Klautau [2004] criticized the experiments of Crammer and Singer [2001], 
Allwein et al. [2000], and presented another set of experiments demonstrating that all methods perform 
roughly the same when the underlying binary classifier is very strong (SVM with a Guassian kernel). As 
our analysis shows, it is not surprising that with enough data and powerful binary classifiers, all methods 
should perform well. However, in many practical applications, we will prefer not to employ kernels (either 
because of shortage of examples, which might lead to a large estimation error, or due to computational 
constraint), and in such cases we expect to see a large difference between the methods. 

Beygelzimer et al. [2007] analyzed the regret of a specific training method for trees, called Filter Tree, 
as a function of the regret of the binary classifier. The regret is defined to be the difference between the 
learned classifier and the Bayes-optimal classifier for the problem. Here again we show that the regret 
values of the underlying binary classifiers are likely to be very large whenever d <C k and the leaves of the 
tree are associated to labels in a random way. Thus in this case the regret analysis is problematic. Several 
authors presented ways to learn better splits, which corresponds to learning the association of leaves to 
labels (see for example Bengio et al. [201 1] and the references therein). Some of our negative results do 
not hold for such methods, as these do not randomly attach labels to tree leaves. 

Daniely et al. [201 1] analyzed the properties of multiclass learning with various ERM learners, and have 
also provided some bounds on the estimation error of multiclass SVM and of trees. In this paper we both 
improve these bounds, derive new bounds for other classes, and also analyze the approximation error of the 
classes. To the best of our knowledge, this is the first case of using VC theory to analyze the approximation 
error of hypothesis classes. 

2 Definitions and Preliminaries 

We first formally define the hypothesis classes that we analyze in this paper. 

Multiclass SVM (MSVM): For W € M fcx ( <i + 1 ) define h w : R d -> [k] by h w (x) = argmax ie[jfe] (Wx)i 
and let C = {h w : W £ R kx( - d+1 ^}. Though NP-hard in general, solving the ERM problem with respect 
to C can be done efficiently in the realizable case (namely, whenever exists a hypothesis with zero empirical 
error on the sample). 

Tree-based classifiers (TC): A tree-based multiclass classifier is a full binary tree whose leaves are as- 
sociated with class labels and whose internal nodes are associated with binary classifiers. To classify an 
instance, we start with the root node and apply the binary classifier associated with it. If the prediction is 1 
we traverse to the right child. Otherwise, we traverse to the left child. This process continues until we reach 
a leaf, and then we output the label associated with the leaf. Formally, a tree for k classes is a full binary 
tree T together with a bijection A : leaf(T) — > [k], which associates a label to each of the leaves. We usually 
identify T with the pair (T, A). The set of internal nodes of T is denoted by N(T). Let H C {±1}* be a 
binary hypothesis class. Given a mapping C : N(T) — > W, define a multiclass predictor, he '■ X — > [k], by 
setting hc(x) — X(v) where v is the last node of the root-to-leaf path v\, . . . v m = v such that Vi+i is the 
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left (resp. right) child of v, if C(vi)(x) = -1 (resp. C(vi)(x) = 1). Let H T = {h c \ C : N(T) -> H}. 
Also, let "Htrecs = Ut is a tree for fe classes Ht ■ If % is the class of linear separators over R d , then for any tree 
T the ERM problem with respect to Ht can be solved efficiently in the realizable case. However, the ERM 
problem is NP-hard in the non-realizable case. 

Error Correcting Output Codes (ECOC): An ECOC is a code M G R kxl along with a bijection 
A : [k] — > [k]. We sometimes identify A with the identity function and M with (M, A) 2 . Given a code M, 
and the result of I binary classifiers represented by a vector u 6 {—1, 1}', the code selects a label via M : 

{— 1, 1} ( — > [k], defined by M(u) = A (axg max,* e [;.] Y^j=i ^ij u ij ■ Given binary classifiers h\,. . . ,h\ 

for each column in the code matrix, the code assigns to the instance x E X the label M(h\(x), . . . , hi(x)). 
Let H C {±1} be a binary hypothesis class. Denote by Hm C [k] x the hypotheses class Hm = {h : 
-=f — ^ [Ar] ] 3(/n,...,fy) G s.t. Vz G X,h(x) = M(hi{x),...,hi(x))}. 

The distance of a binary code, denoted by 5(M) for M G {±l} kxl , is the minimal hamming dis- 
tance between any two pairs of rows in the code matrix. Formally, the hamming distance between u,v 6 
{-1,+1} 1 is A h (u,u) = |{r : u[r] ^ u[r]}|, and 5(M) = min^^fe A h (M[i],M\j}). The ECOC 
paradigm described in [Dietterich and Bakiri, 1995] proposes to choose a code with a large distance. 

One vs. All (OvA) and All Pairs (AP): Let H c {±1}* and k > 2. In the OvA method we train 
k binary problems, each of which discriminates between one class and the rest of the classes. In the AP 
approach all pairs of classes are compared to each other. This is formally defined as two ECOCs. Define 
M 0vA G M. kxk to be the matrix whose elements is 1 if z = j and — 1 if i ^ j. Then, the hypothesis 

class of OvA is 'HovA = "H m o,a. For the AP method, let M AP G M fex (2) be such that for all i G [k] and 
1 < j < I < k, the coordinate corresponding to row i and column (j, I) is defined to be —1 if i = j, 1 if 
i = I, and otherwise. Then, the hypothesis class of AP is Hap = Hm**- 

Our analysis of the estimation error is based on results that bound the sample complexity of multiclass 
learning. The sample complexity of an algorithm A is the function ttia defined as follows: For e, 6 > 0, 

(e, 8) is the smallest integer such that for every m > rriA{e, 6) and every distribution V on X x y, with 
probability of > 1 — 8 over the choice of an i.i.d. sample S of size m, 

Err(A(5 m )) < minErrf/i) + e . (2) 
hew 

The first term on the right-hand side is the approximation error of H. Therefore, the sample complexity is 
the number of examples required to ensure that the estimation error of A is at most e (with high probability). 
We denote the sample complexity of a class H by m-^(e, 5) = infyi m^e, S), where the infimum is taken 
over all learning algorithms. 

To bound the sample complexity of a hypothesis class we rely on upper and lower bounds on the 
sample complexity in terms of two generalizations of the VC dimension for multiclass problems, called the 
Graph dimension and the Natarajan dimension and denoted dc(T-L) and cIn(H). For completeness, these 
dimensions are formally defined in the appendix. 

Theorem 2.1. Daniely et al. [201 1 ] For every hypothesis class %, and for every ERM rule, 

" I ^ — I < m H {e,5) < m ERM (e,b) < O I ^ — I 

We note that the constants in the O. O notations are universal. 

2 The use of A here allows us to later consider codes with random association of rows to labels. 
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3 Main Results 



In Section 3.1 we analyze the sample complexity of the different hypothesis classes. We provide lower 
bounds on the Natarajan dimensions of the various hypothesis classes, thus concluding, in light of The- 
orem 2.1, a lower bound on the sample complexity of any algorithm. We also provide upper bounds on 
the graph dimensions of these hypothesis classes, yielding, by the same theorem, an upper bound on the 
estimation error of ERM. In Section 3.2 we analyze the approximation error of the different hypothesis 
classes. 

3.1 Sample Complexity 

Together with Theorem 2. 1 , the following theorems estimate, up to logarithmic factors, the sample com- 
plexity of the classes under consideration. We note that these theorems support the rule of thumb that the 
Natarajan and Graph dimensions are of the same order of the number of parameters. The first theorem 
shows that the sample complexity of MSVM depends on Q(dk). 

Theorem 3.1. d(k- 1) < d N {£) < d G {C) < 0(dklog(dk)). 

Next, we analyze the sample complexities of TC and ECOC. These methods rely on an underlying 
hypothesis class of binary classifiers. While our main focus is the case in which the binary hypothesis class 
is halfspaces over Mr, the upper bounds on the sample complexity we derive below holds for any binary 
hypothesis class of VC dimension d+ 1. 

Theorem 3.2. For every binary hypothesis class ofVC dimension d + 1, and for any tree T, dcijHr) < 
d G (Htr C cs) < 0(dk log(dfc)). If the underlying hypothesis class is halfspaces over M , then also 



Theorems 3.1 and 3.2 improve results from Daniely et al. [201 1] where it was shown that |_f J |_f J < 
<In{£) < 0(dk log(dfc)), and for every tree d G (HT) < 0(dk \og(dk)). Further it was shown that if T-l is 
the set of halfspaces over R d , then £1 [-~-f--j < d N {Ti T )- 




We next turn to results for ECOC, and its special cases OvA and AP. 

Theorem 3.3. For every M e R kxl and every binary hypothesis class of VC dimension d, c?g(^m) < 
0(dl \og(dl)). Moreover, if M G {±l} fcx ' and the underlying hypothesis class is halfspaces over M. d , then 



We note if the code has a large distance, which is the case, for instance, in random codes, then S(M) = 
O(Z) . In this case, the bound is tight up to logarithmic factors. 

Theorem 3.4. For any binary hypothesis class of VC dimension d, d G {%ovA) < 0(dklog(dk)) and 
d G (HAp) < 0(dk 2 \og(dk)). If the underlying hypothesis class is halfspaces over M. d we also have: 



3.2 Approximation error 

We first show that the class C essentially contains HovA and Ht for any tree T, assuming, of course, that 
W is the class of halfspaces in M. d . We find this result quite surprising, since the sample complexity of all 
of these classes is of the same order. 

Theorem 3.5. C essentially contains "Htrccs and HovA- These inclusions are strict for d > 2 and k > 3. 



d(k - 1) < d N (H T ) < d G (H T ) < d G (H t rccs) < 0(dk\og(dk)). 




d-S(M)/2 < d N (H M ) < d G {H M ) < 0{dl\og(dl)) . 



d{k - 1) < d N (H vA) < d G (HovA) < 0{dk\og{dk)) and 
4Y) < d N (H A p) < d G (H A p) < 0(dk 2 \og(dk)). 
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One might suggest that a small increase in the dimension would perhaps allow us to embed C in Ht for 
some tree T or for OvA. The next result shows that this is not the case. 

Theorem 3.6. Any embedding into a higher dimension that allows HovA or Ht (for some tree T for k 
classes) to essentially contain C, necessarily embeds into a dimension of at least Q(dk). 

The next theorem shows that the approximation error of AP is better than that of MSVM (and hence 
also better than OvA and TC). This is expected as the sample complexity of AP is considerably higher, and 
therefore we face the usual trade-off between approximation and estimation error. 

Theorem 3.7. Hap essentially contains C. Moreover, there is a constant k* > 0, independent of d, such 
that the inclusion is strict for all k > k*. 

For a random ECOC of length o(k), it is easy to see that it does not contain MSVM, as MSVM has 
higher complexity. It is also not contained in MSVM, as it generates non-convex regions of labels. 

We next derive absolute lower bounds on the approximation errors of ECOC and TC when d <C k. 
Recall that both methods are built upon binary classifiers that should predict h(x) = 1 if the label of x is in 
L, for some Lc [k], and should predict h(x) = — 1 if the label of x is not in L. As the following lemma 
shows, when the partition of [k] into the two sets L and [k] \ L is arbitrary and balanced, and k 3> d, such 
binary classifiers will almost always perform very poorly. 

Lemma 3.8. There exists a constant C > for which the following holds. Let H C {±1}^ be any 
hypothesis class of VC-dimension d, let fi € (0, 1/2], and let T> be any distribution over X x [k] such that 
Vi P(a;,2/)~x>(2/ = i) < x - Let 4> : [k] —> {±1} be a randomly chosen function which is sampled according 
to one of the following rules: (1) For each i £ [k], each coordinate <f>{%) is chosen independently from the 
other coordinates and F((f>(i) = —1) = \r, or (2) <fi is chosen uniformly among all functions satisfying 

\{i e [k] : 4>(i) = -l}| =fik. 

Let !)<[, be the distribution over X x {±1} obtained by drawing (x, y) according to T> and replacing it 
with (ar, 4>{y))- Then, for any v > 0, if k > C ■ f ~^fa^ » then with probability of at least 1 — 5 over the 
choice of (j), the approximation error ofH with respect to T>$ will be at least p, — v. 

As the corollaries below show, Lemma 3.8 entails that when fc ^> d, both random ECOCs with a small 
code length, and balanced trees with a random labeling of the leaves, are expected to perform very poorly. 

Corollary 3.9. There is a constant C > for which the following holds. Let (T, A) be a tree for k 
classes such that X : leaf (T) — > [k] is chosen uniformly at random. Denote by ki and k^ the number of 
leaves of the left and right sub-trees (respectively) that descend from root, and let fx — min{^, %}. Let 
H C {±1} X be a hypothesis class of VC-dimension d, let v > 0, and let T> be any distribution over X X [k] 

such that V« W(x,y)~T>(y — i) < Then, for k > C ■ ( 5 - ^ , with probability of at least 1 — S over 
the choice of X, the approximation error ofHr with respect to T> is at least /U — v , 

Corollary 3.10. There is a constant C > Ofor which the following holds. Let (M, A) be an ECOC where 
M £ M. kxl , and assume that the bijection X : [k] — > [k] is chosen uniformly at random. Let H C {±1}^ 
be a hypothesis class of VC-dimension d, let v > 0, and let T> be any distribution over X x [k] such that 
Mi V( x .y)~v(y = *) 5: x - Then, for k > C ■ (^ dl log ^^ +ln ^- ) j r w ith probability of at least 1 — 5 over the 
choice of X, the approximation error o/Hm with respect to T> is at least 1/2 — v. 

Note that the first corollary holds even if only the top level of the binary tree is balanced and splits the 
labels randomly to the left and the right sub-trees. The second corollary holds even if the code itself is not 
random (nor does it have to be binary), and only the association of rows with labels is random. In particular, 
if the length of the code is 0(\og(k)), as suggested in Allwein et al. [2000], and the number of classes is 
Cl(d), then the code is expected to perform poorly. 



7 



For an ECOC with a matrix of length f2(fc) and d = o(k), we do not have such a negative result as 
stated in Corollary 3.10. Nonetheless, Lemma 3.8 implies that the prediction of the binary classifiers when 
d = o(k) is just slightly better than a random guess, thus it seems to indicate that the ECOC method will 
still perform poorly. Moreover, most current theoretical analyses of ECOC estimate the error of the learned 
multiclass hypothesis in terms of the average error of the binary classifiers. Alas, when the number of 
classes is large, Lemma 3.8 shows that this average will be close to |. 

Finally, let us briefly discuss the tightness of Lemma 3.8. Let x\,.. . , Xd+i € M. d be affmely indepen- 
dent and let V be the distribution over R d x [d + 1] defined by ^{x,y)~v{{ x i y) = { x ii *)) = 3TT- ^ s * s 
not hard to see that for every </> : [d + 1] — >• {±1}, the approximation error of the class of halfspaces with 
respect to T>^ is zero. Thus, in order to ensure a large approximation error for every distribution, the number 
of classes must be at least linear in the dimension, so in this sense, the lemma is tight. Yet, this example is 
very simple, since each class is concentrated on a single point and the points are linearly independent. It is 
possible that in real-world distributions, a large approximation error will be exhibited even when k < d. 

We note that the phenomenon of a large approximation error, described in Corollaries 3.9 and 3.10, does 
not reproduce in the classes C, HovA an d T~Lap, since these classes are symmetric. 

4 Proof Techniques 

Due to lack of space, the proofs for all the results stated above are provided in the appendix. In this section 
we give a brief description of our main proof techniques. 

Most of our proofs for the estimation error results, stated in Section 3. 1, are based on a similar method 
which we now describe. Let L : {±1} ( — > [k] be a multiclass-to-binary reduction (e.g., a tree), and for 
H C {±1}*, denote L(U) = {x i-> L(hi(x), . . . , h t (x)) | hi,...,h t e H}. Our upper bounds for 
da(L('H)) are mostly based on the following simple lemma. 

Lemma 4.1. IfVC(H) = d then d G (L{H)) = 0{ldhx{ld)). 

The technique for the lower bound on d^(L(W)) when W is the class of halfspaces in M. d is more 
involved, and quite general. We consider a binary hypothesis class Q C {±l}[ d l x W which consists of 
functions having an arbitrary behaviour over [d] x {{}, and a very uniform behaviour on other inputs (such 
as mapping all other inputs to a constant). We show that L(Q) iV-shatters the set [d] x [/]. Since Q is quite 
simple, this is usually not very hard to show. Finally, we show that the class of halfspaces is richer than Q, 
in the sense that the inputs to Q can be mapped to points in R d such that the functions of Q can be mapped 
to halfspaces. We conclude that d^(L{yV s )) > dN{L{G))- 

To prove the approximation error lower bounds stated in Section 3.2, we use the techniques of VC theory 
in an unconventional way. The idea of this proof is as follows: Using a uniform convergence argument based 
on the VC dimension of the binary hypothesis class, we show that there exists a small labeled sample S 
whose approximation error for the hypothesis class is close to the approximation error for the distribution, 
for all possible label mappings. This allows us to restrict our attention to a finite set of hypotheses, by 
their restriction to the sample. For these hypotheses, we show that with high probability over the choice of 
label mapping, the approximation error on the sample is high. A union bound on the finite set of possible 
hypotheses shows that the approximation error on the distribution will be high, with high probability over 
the choice of the label mapping. 

5 Implications 

The first immediate implication of our results is that whenever the number of examples in the training set is 
Q(dk), MS VM should be preferred to OvA and TC. This is certainly true if the hypothesis class of MS VM, 
C, has a zero approximation error (the realizable case), since the ERM is then solvable with respect to C. 
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Note that since the inclusions given in Theorem 3.5 are strict, there are cases where the data is realizable 
with MSVM but not with Ho v a or with respect to any tree. 

In the non-realizable case, implementing the ERM is intractable for all of these methods. Nonetheless, 
for each method there are reasonable heuristics to approximate the ERM, which should work well when 
the approximation error is small. Therefore, we believe that MSVM should be the method of choice in this 
case as well due to its lower approximation error. However, variations in the optimality of algorithms for 
different hypothesis classes should also be taken into account in this analysis. We leave this detailed analysis 
of specific training heuristics for future work. Our analysis also implies that it is highly unrecommended 
to use TC with a randomly selected A or ECOC with a random code whenever k > d. Finally, when the 
number of examples is much larger than dk 2 , the analysis implies that it is better to choose the AP approach. 

To conclude this section, we illustrate the relative performance of MSVM, OvA, TC, and ECOC, by 
considering the simplistic case where d = 2, and each class is concentrated on a single point in M 2 . In the 
leftmost graph below, there are two classes in M 2 , and the approximation error of all algorithms is zero. In 
the middle graph, there are 9 classes ordered on the unit circle of M 2 . Here, both MSVM and OvA have a 
zero approximation error, but the error of TC and of ECOC with a random code will most likely be large. In 
the rightmost graph, we chose random points in M 2 . MSVM still has a zero approximation error. However, 
OvA cannot learn the binary problem of distinguishing between the middle point and the rest of the points 
and hence has a larger approximation error. 




MSVM / / / 

OvA / / X 

TC/ECOC /XX 
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A Proofs 

A.l Notation and Definitions 

Throughout the proofs, we fix d, k > 2. We denote by W — W d = {h w : w G M d+1 } the class of linear 
separators (with bias) over Mr. We assume the following "tie breaking" conventions: 

• For / : [k] — > R, argmax i6 r fe j/(i) is the minimal number io G [k] for which f(io) = max^^] /(?); 

• sign(O) = 1. 

Given a hypotheses class % C y x , denote its restriction to A C Xby%\A = {/U : / € H}. LefH C y x 
be a hypothesis class and let <j> : y — > y' , i : X — > X' be functions. Denote 4>oT-L~{(f>oh: H} and 
Hoi = {hoL:he %}. 

Given % C y x and a distribution T> over X x y, denote the approximation error by Errp('H) = 
inf/jg^ Errx>(/i). Recall that by definition 1.1, % essentially contains H' C y x if and only if Errp("H) < 
Err^,(K') for every distribution T>. For a binary hypothesis class H, denote its VC dimension by VC('H). 

Let H C y x be a hypothesis class and let S C. X. We say that H G-shatters S if there exists an 
/ : S — > y such that for every T C 5 there is a 5 G "H such that 

Vx G T, = /(x), and Vi e 5 \ T, ^ /(a;). 

We say that "H N-shatters S if there exist /1, f% : S — > ^ such that Vy G S, /i(y) 7^ fi{y), and for every 
TCS there is a g G % such that 

Vz G T, = h{x), and Vx G 5 \ T, g{x) = f 2 (x). 

The graph dimension of "H, denoted daifl.), is the maximal cardinality of a set that is G-shattered by H. 
The Natarajan dimension of "H, denoted djy(H), is the maximal cardinality of a set that is N-shattered by 
H. Both of these dimensions coincide with the VC-dimension for \y\ = 2. Note also that we always have 

d N (H) < d G (U). As shown in Ben-David et al. [1995], it also holds that d G (U) < 4.671og 2 (|y|)d A r('H). 

Proof of Lemma 4.1. Let A C X be a G-shattered set with \A\ = da(L(H)). By Sauer's Lemma, 2 |A| < 
\H\ A \ l < \A\ dl , thus d G (L{H)) = \A\ - 0(ld\og(ld)). ' □ 
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A.2 Multiclass SVM 

Proof of Theorem 3.1. The lower bound follows from Theorems 3.5 and 3.2. To upper bound d G ■= da(C), 
let S = {xi, ... , Xd G } C M d be a set which is G-shattered by £, and let / : S — > [k] be a function that 
witnesses the shattering. For xef 1 and j G [k], denote 

4>{x, j) = (0, ... 0, x[l], . . . , x[d],l, 0, . . . , 0) G R( d+1 ) fe , 

where a;[l] is in the (d + l)(j — 1) coordinate. For every (i, j) G [e2 G ] x [fc], define Zij = <^>(xi, f{xi)) — 
4>(x h j). Denote Z = {z„ | € [d G ] X [jfe]}. Since VC(W (d+1)fc ) = (d + l)k + 1, by Sauer's lemma, 

< | Z |(rf+i)fc+i = (d G fc)( d + 1 ) fc + 1 . 

We now show that there is a one-to-one mapping from subsets of S to W^ d+1 ^ k \z, thus concluding an upper 
bound on the size of S. For any TCS, choose W(T) G K fex ( d+1 )(M) such that 

{a; G 5 | h w{T) {x) = f(x)} = T. 

Such a W(T) exists because of the G-shattering of S by C using the witness /. Define the vector w(T) G 
]^fe(d+i) which j s the concatenation of the rows of W(T), that is 

W (T) = (W(T) (ltl) , W(T) {hd+1) ,. . . , W(T) (kA) , W(T) (kid+1) ) 

. Now, suppose that T x ^ T 2 for Ti,T 2 C S. We now show that w{T\)\z ^ w(T 2 )\z- Suppose w.l.o.g. 
that there is some Xi G T\ \ T 2 . Thus, f(xi) = h w ^ Tl )(xi) ^ h>w(T 2 )( x i) =: 3- It follows that the inner 
product of with row /(axj) of W(Ti) is greater than the inner product of Xi with row j of W(Tx), while 
for W(T 2 ), the situation is reversed. Therefore, sign((u>(Ti), Zij)) ^ sign((u>(T 2 ), Zij)), so w(Ti) and 
u>(T 2 ) induce different labelings of Z. It follows that the number of subsets of S is bounded by the size of 
W [d+1)k \ z , thus 2 dG < (fcd G )( d+1 ) fe+1 . We conclude that d G < 0{dk\og(dk)). □ 

A.3 Simple classes that can be represented by the class of linear separators 

In this section we define two fairly simple hypothesis classes, and show that the class of linear separators is 
richer than them. We will later use this observation to prove lower bounds on the Natarajan dimension of 
various multiclass hypothesis classes. 

Let? > 2. For / G {-1,1}^, i G [I], j G {-1, 1} define : [d] x [I] -> {-1,1} by 

f J (u,w) = < . 

And define the hypothesis class T l as 

•'" {/":/• \--\\' ( ■ [I], iG {-1,1}}. 
For g G {-1, l} [d \ i G [l], j G {±1} define : [d] x [I] ->■ {-1, 1} by 

ih(u) v = i 
j v > i 
-j v < i, 

And define the hypothesis class Q l as 

g 1 ^^ : 9 €{-l,l} [d] , ie[l], je{±i}}. 
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Let T-L C y C y x be two hypotheses classes. We say that T-L is richer than W if there is a 
mapping l : X' — > X such that W = H o l. It is clear that if T~L is richer than T-L' then djv("W) < cLn{1-C) 
and daiW) < daiH)- Thus, the notion of richness can be used to establish lower and upper bounds on 
the Natarajan and Graph dimension, respectively. The following lemma shows that W is richer than T l and 
Q l for every I. This will allow us to use the classes J 71 , Q l instead of W when bounding from below the 
dimension of an ECOC or TC hypothesis class in which the binary classifiers are from W. 

Lemma A.l. For any integer I > 2, VV is richer than T l and Q l . 

Proof. We shall first prove that W is richer than T l . Choose I unit vectors e±, . . . , e; £ R d . For every 
i £ [/], choose d affinely independent vectors such that 

■ ■ • , G {x G R d : (a;, e,) = 1, Vi' 7^ i, (x, e^) < 1}. 

This can be done by choosing d affinely independent vectors in {x £ R d : (x, e^) = 1} that are very close 
to e.j. Define t(m,i) = x m ^. Now fix i £ [I] and j £ { — l.+l}, and let / lJ £ T l . We must show 
that / , J = h o l for some h £ W. We will show that there exists an affine map A : R d —> M for which 
p,i = s jg n oAot. This suffices, since W is exactly the set of all functions of the form sign o A where A is 
an affine map. Define M = {x £ R d : (x, e,) = 1}, and let A : M M be the affine map defined by 

Vrn £ [d], A(x m> i) = f{m,i). 

Let P : R d ->• M be the orthogonal projection of R d on M. For aeR, define an affine map A Q : R d ->• R 
by 

A a {x) = A(P(x)) +a-(x-e i: a). 

Note that, Vm e [d], h. a (x m ^) — f(m,i). Moreover, for every i' ^ i and m £ [d] we have (x m ^> — 
ei,ej) < 0. Thus, by choosing \a\ sufficiently large and choosing sign(a) depending on j, we can make 
sure that f l J = sign o A Q o 1, 

The proof that W is richer than tj' is similar and simpler. Let ei, . . . , G be affinely indepen- 

dent. Define 

t(m, i) = (e m , i) G M^ 1 x R = M d , 

Given ./'-' G G dd , let A : R^ 1 x {i} -> R be the affine map defined by A(e m , i) = g lJ (m, i) and let 
P : R d ->• x {«} be the orthogonal projection. Define A : R d ->• R by 

A(a: > i/) = i4(P(a; l y))+i-lQ.(i/-i). 

It is easy to check that sign o A o u = g l -K □ 

Note A.2. From Lemma A.l it follows that VG(J- l ),VC(G l ) < d + 1. On the other hand, both T l and Q l 
shatter {[d] x {1}) U {(1,2)}. Thus, VC(P) = VC{Q l ) =d+l 

A.4 Trees 

Proof of Theorem 3.2. We first prove the upper bound. Let A C X be a G-shattered set with \A\ = 
dg('H trcos ). By Sauer's Lemma, and since the number of trees is bounded by k k , we have 

2^ < k k ■ \H\ A \ k < k k ■ \A\ dk , 

thus d G (n trccs ) = \ A\ = 0{dk\og(dk)). 

To prove the lower bound, by Lemma A.l, it is enough to show that d^(Q l T ) > d ■ (k — 1) for some /. 
We will take I — \N(T)\ = k — 1. Linearly order N(T) such that for every node v, the nodes in the left 
sub-tree emanating from v are smaller than the nodes in the corresponding right sub-tree. We will identify 
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[I] with N(T) by an order-preserving map, thus G l C {-1, l}[ rf l xAr ( T ). We also identify the labels with the 
leaves. 

Define gi : [d] x N(T) — > leaf(T) by setting gi(i, v) to be the leaf obtained by starting from the node 
v, going right once and then going left until reaching a leaf. Similarly, define g 2 ■ [d] x N(T) — > leaf(T) 
by setting g2(i, v) to be the leaf obtained by starting from the node v, going left once and then going right 
until reaching a leaf. 

We shall show that g u g 2 witness the iV-shattering of [d] x N(T) by Q l T . Given S C [d] x N(T) define 
C : N(T) -> by 

— 1 u < v 

1 U > V 

1 u = v, (i, u) £ S 1 

— 1 u = v, (i, u) ^ S. 

It is not hard to check that V(i, it) £ S 1 , hc(i,u) — gi(i,u), and\/(i,u) ^ S, hc(i,u) = g>}(i,u). □ 

NoteA.3. Define Q l = {g 1 ' 1 : g € {-1,1} M , i G [/]}. The proof shows that d N (G l T ) >d-(k-l). Since 
VC(^') = d, we obtain a simpler proof of Theorem 23 from Daniely et al. [2011], which states that for 
every tree T there exists a class H ofVC dimension d for which (ijv(^T) > d(k — 1). 

A.5 ECOC, One vs. All and All Pairs 

To prove the results for ECOC and its special cases, we first prove a more general theorem, based on the 
notion of a sensitive vector for a given code. Fix a code M £ M. kxl (R). We say that a binary vector 
u <G {±1}' is q-sensitive for M if there are q indices j G [I] for which M{u) 7^ M(u@ ej). Here, 
u®e 3 := (u[l],...,-u\j],...,u[l]). 

Theorem A.4. If there exists a q-sensitive vector for a code M £ M. kxl (M.) then div(Wj\/) > d ■ q. 

Proof. By Lemma A.l, it suffices to show that d?{(J~ l M ) > d ■ q. Let u £ {±1} J be a g-sensitive vector. 
Assume w.l.o.g. that the sensitive coordinates are 1, . . . ,q. We shall show that [d] x [q] is iV-shattered by 
P M . Define 51, 52 : [d] x [q] -> [k] by 

gi(x,y) = M(u), gz(x,y) = M{u®e y ) 

Let T C [d] x [q\. Define hi, . . . , hi £ T l as follows. For every j > q, define hj = u[j]. For j < q 
define 

{«[?] y ^ i 
2/ = i, e [d] x [g]\r. 

For h = (hi , . . . , hi), it is not hard to check that 

V(x,y) £ T, M(hi(x,y), . . .,hi(x,y)) = gi(x,y), and 
V(x,y) £ [d] x [q]\T, M(hx{x, y), . . . , h{x, y)) = g 2 (x, y). 

□ 

The following lemma shows that a code with a large distance is also highly sensitive. In fact, we prove a 
stronger claim: the sensitivity is actually at least as large as the distance between any row and the row closest 
to it in Hamming distance. Formally, we consider A(M) = max^ min^ Ah(M[i], M[j]) > 5(M). 

LemmaA.5. For any code M £ K fcx/ (±1), there is a q-sensitive vector for M, where q > iA(M) > \5(M). 



C{v){i,u) 
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Proof. Let i\ the row in M such that its hamming distance to the row closest to it is A(M). Denote by i 2 
the index of the closest row (if there is more than one such row, choose one of them arbitrarily). We have 
A h (M[h},M[i 2 }) = A(M). In addition, Vi ^ n, « 2 , A h (Af [n], M[i]) > A(M). Assume w.l.o.g. that 
the indices in which rows ij and i 2 differ are 1, ... , A(M). Consider first the case that i\ < i 2 . Define 

u £ {±1}W by 



M(i 2i j) otherwise. 



Is is not hard to check that for every 1 < j < \— ] , ii = M(u) and M(u©ej) = 12, thus u is -sensitive. 
If ?i > 12, the proof is similar except that u is defined as 



M^jj) otherwise. 



□ 



Proof of Theorem 3. 3. The upper bound follows from Lemma 4.1. The lower bound follows form Theorem 
A.4 and Lemma A.5. □ 

Proof of Theorem 3.4. The upper bounds follow from Theorem 3.3. To show that £2/v(Wova) > (& — 1)<^, 
we note that the all-negative vector u = (— 1, . . . , — 1) of length k is (k — 1) -sensitive for the code M 0vA , 
and apply Theorem A.4. 

To show that g?jv(Wap) > d( k ^ 2 1 )-> assume for simplicity that k is odd (a similar analysis can be given 
when k is even). Define u £ {±1}(5) by 



V« < j, u[i,j] 



1 3 ~ i < ¥ 
— 1 otherwise. 



For every n € [fc], we have 53i<i<j<ib u [hj] ' = 0, as the summation counts the number 

of pairs such that n € {i, j} and .x agrees with tt[i,j]. Thus, Af AP (u) = 1, by our tie- 

breaking assumptions. Moreover, it follows that for every 1 <i < j < k, we have Af AP (u®e(i.j)) € {z, j}, 
since flipping entry of u increases (M Ap u)j or (M u)i by 1 and does not increase the rest of the 
coordinates of the vector M Ap u. This shows that u is ( fe ~ 1 )-sensitive. □ 



A.6 Approximation 

Proof of Theorem 3.5. We first show that for any tree for k classes T, C essentially contains Wt- It follows 
that C essentially contains Wt re es as well. Let T> a distribution over R d , let C : N(T) — > W be a mapping 
associating nodes in T to binary classifiers in W, and let e > 0. We will show that there exists a matrix 
W g M fex ( d+1 ) such that Pr x ^ v [h w (x) ^ M^)] < e - 

For every v £ N(T), denote by w(v) £ R d+1 the linear separator such that C(v) — h w t v \. For every 
w £ R d+1 define w = w + (0, . . . , 0,7). Recall that for x £ R d , x £ R d+1 is simply the concatenation 
(x,l). Chooser > large enough so that Pr^^-rj [| 1^1 1 > r] < e/2andVw £ N(T), \\w(v)\\ < r. Choose 
7 > small enough so that 

Pr [3v £ N(T), (w(v), x) £ (-7, 7)] = Pr [3v £ N(T), (w(v),x) £ (-2 7 , 0)] < e/2. 

Let a — 2r 2 /7 + 1. For i £ [k], let , Vi <mi be the path from the root to the leaf associated with label 
i. For each 1 < j < rrii define bij = 1 if Ujj+i is the right son of Vij, and bij = —1 otherwise. Now, 
define W £ ]R fex ( d+1 ) to be the matrix whose i'th row is w l = J^J^i 1 a ^ ' bi,jti>( v i,j)- 
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To prove that Yv xr ^x>[hw{x) ^ hc{%)] < e, it suffices to show that hw{x) = hc(x) for every x € K d 
satisfying ||x|| < r and Vv € N(T), (w(v),x) ^ (— 7,7), since the probability mass of the rest of the 
vectors is less than e. Let a; € M. d be a vector that satisfies these assumptions. Denote i\ = hc(x). It 
suffices to show that for all i 2 G [k] \ {ii}, (w^ , x) > (wi 2 , x), since this would imply that hw{%) = H 
as well. 

Indeed, fix i% 7^ ii, and let jo be the length of the joint prefix of the two root-to-leaf paths that match 
the labels i\ and In other words, Vj < jo, v; n ^ — u,- 2 j and JO +i 7^ Wt 2 ,io+i- Note that 

&( b i u j ~ b i 2 ,3o)ti(v iu j )) = (x,2b iltjo w(v iujo )) =2|(x,«i(« jli j ))| > 2 7 . 

The last equality holds because bi lt j and (x, w(vi lt j )) have the same sign by definition of 6,.j. We have 

(wii.ic) - (w i2 ,x) = (x, a~ b iu jw(v iuj )- ^ a~ b i2tj w(v i2tj )) 

m^- 1 mi 2 — 1 

= (x,a- JO (6 il!io -& i2>jD )u}(u ilJo ))-l-(x, a~ J b iltJ w(v ild )- ^ a~ ] b i2j w(v i3d )) 

j=io+i j=jo+i 

oo 

> (x,a- JO (6 il!io - b i2 j )w(v ildo )) - a-nr 2 



> 2a _j0 I 7 - - — -J > 0. 

Since this holds for all 12 7^ ii, it follows that ftytr(a;) = ii- Thus, we have proved that £ essentially 
contains W iIees . 

Next, we show that C strictly contains Wt rC cs> by showing a distribution over labeled examples such 
that the approximation error using C is strictly smaller than the approximation error using Wtrccs- Assume 
w.l.o.g. that d = 2 and k = 3: even if they are larger we can always restrict the support of the distribution 
to a subspace of dimension 2 and to only three of the labels. Consider the distribution T> over E 2 x [3] such 
that its marginal over M 2 is uniform in the unit circle, and Pr(x,y)~r>[X = i \ X = x] = l[x G Dj\, where 
Di, D 2 ,D 3 be subsets sectors of equal angle of the unit circle (see Figure 1): 



Figure 1: Illustration for the proof of Theorem 3.5 

Clearly, by taking the rows of W to point to the middle of each sector (dashed arrows in the illustration), 
we get Err£>(£) = 0. In contrast, no linear separator can split the three labels into two gropus without error, 
thus Err^, (Wtrccs) > 0. 

Finally, to see that £ essentially contains WovA> we n °te that WovA = where T is a tree such 
that each of its internal nodes has a leaf corresponding to one of the labels as its left son. Thus WovA is 
essentially contained in Wtrccs- D 

Proof of Theorem 3.7. It is easily seen that Wap contains £: Let W G M d+lxfe , and denote its i'th row by 
W[i}. For each column (i,j) of M AP , define the binary classifier hij G W such that Vx G K d , hij(x) = 
sign{(W\j] - W[i],x)). Then for all x, h w {x) = M^(h 1A (x), h k ^. k (x)). 
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To show that the inclusion is strict, as in the proof of Theorem 3.5, we can and will assume that d = 2. 
Choose k* to be the minimal number such that for every k > k*, c?jv(Wap) > dj\r(£): This number exists 
by Theorems 3.4 and 3. 1 (note that though we chose k* w.r.t. d = 2, the same k* is valid for every d). For 
any k > k* , it follows that there is a set S C M 2 that is TV-shattered by Wap but not by C. Thus, there is 
a hypothesis h £ Wap such that for every g £ C, g\s ^h\s- Define the distribution T) to be uniform over 
{(x, h(x)) :xeS}. Then clearly Vrr* v (£) > Err* v {W A p) = 0. □ 

Next, we prove Theorem 3.6, which we restate more formally as follows. Note that the result on OvA 
is implied since there exists a tree that implements OvA. 

Theorem A.6. (Restatement of Theorem 3.6) If there exists an embedding i : R d — > R d and a tree T such 
that Wj. o l essentially contains C, then necessarily d' > fl(dk). 

Proof. Assume that i 6 [k] is the class corresponding to the leaf with the least depth, /. Note that / < 
log 2 (fc). Let tj> : [k] — > {±1} be the function that is 1 on {i} and —1 otherwise. It is not hard to see that 
4> o £ is the hypothesis class of convex polyhedra in R d having k — 1 faces. Thus, 

VCO o £) > {k - l)d, (3) 

[see e.g. Takacs, 2009]. On the other hand, 4> ° , is the class of convex polyhedra in M. d having 
I < l°g2(' c ) faces. Thus, by Lemma 4.1 

VC{4>oW$ ol)< VC(<po y$) < 0{ld! log(W)) < 0(log(fc)d'log(log(fc)d')) (4) 

By the assumption that o i essentially contains £, VC(</> o £) < VC(0 o VVj. o l). Combining with 
equations (3) and (4) it follows that d(k - 1) = 0(log(fc)<f log(log(jfc)d'))- Thus - d' =Q (dk). □ 

To prove Lemma 3.8, we first state the classic VC-dimension theorem, which will be useful to us. 

Theorem A.7 (Vapnik [1998]). There exists a constant C > such that for every hypothesis class T-L C 
{±1}"* ofVC dimension d, a distribution T> over X, e, 5 > and m > C ^" 1 " 1 we have 

Pr 

We also use the following lemma, which proves a variant of Hoeffding's inequality. 

Lemma A.8. Let pi, ... ,0k > and let 71, . . . ,7^ € R, such that Vi, |7j| < ft. Fix an integer j G 
{1, . . . , Lf J } and let fi = j Jk. Let (X%, . . . , Xk) G {±l} fc be a random vector sampled uniformly from 
the set {(xi, . . . , X/.) '■ X^i=i Zl 2 1 " 1 = M^}- Define Yi — Pi + Xiji and denote cti = ft + |7i|. Assume that 

( 

Proof. First, since < g, it suffices to prove the claim for the case Vi, 7, > since this is the "harder" 
case. Let Z\, . . . , Zf. € {±1} be independent random variables such that Pr[Zj = 1] = p, — |. Denote 

= ft + Zi7i. Further denote = ^i=i Wi and Z = Ylt=i ^T 1 - 

Note that for every jo < j = pk, given that Z = jo, W can be described as follows: We start with the 
value Yli=i ft ~ 7» anc l tn en choose jo indices uniformly from [k]. For each chosen index i, the value of W 



ErrJ,(H) > inf Err s (/i) 



> 1-5. 



Pr 



^Yi<p-e 
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is increased by 2^. J2i=i ^ can ^ e described in the same way, except that that j > jo indices are chosen. 



Thus, Pr 



Ei=i Y i < M - e < Pr [W < fi - e | Z = j ] . Thus, we have 



Pr 



^YKft-e 



i=i 



< Pr [W < ft - e | Z < ftk] 

< Pr [W < ft - e] j Pr [Z < ftk] 

< 2 Pr [W < ft - e] 

The last inequality follows from Hoeffding's inequality and noting that 

E[Wi] = ft + (2(ft - |) - 1) 7 » = - + 7i) + (!"/*+ - 7.) > fa - 



SothatEti^[^]>(M-f)E* 



□ 



Proof of Lemma 3.8. The idea of this proof is as follows: Using a uniform convergence argument based on 
the VC dimension of the binary hypothesis class, we show that there exists a labeled sample 5 such that 
\S\ ps and for all possible mappings tf>, the approximation error of the hypothesis class on the sample 
is close to the approximation error on the distribution T)^. This allows us to restrict our attention to a finite 
set of hypotheses, based on their restriction to the sample. For these hypotheses, we show that with high 
probability over the choice of cj), the approximation error on the sample is high. Using a union bound on the 
possible hypotheses, we conclude that the approximation error on the distribution will be high, with high 
probability over the choice of 4>. 

For i e [k], denote p, = Pr x ^ B [/(x) = i]. Let S = {{xi,yi), . . . , {x m ,y m )} C X X [k] be an i.i.d. 



sample drawn according to T> where m = \C 



d+(k+2) ln(2) 
W2) 2 



] , for the constant from C from Theorem A. 7. 



Given S, denote ^{(a*, <p[y x )), (x m , 4>{y m ))} X x {±1}. For i e [k], let A = ■ 

For any fixed <fi : [k] — > {±1}, with probability > 1 — 2~( fc+2 ) over the choice of S we have, by 
Theorem A.7, that ErrJ, (H) > M h eH Err 50 fa) - v. Since |{±1}W| = 2 fc , w.p. > 1 - |, 



Moreover, we have 



€{±1}W, Err?, (W)> inf Err s ,fa) - 



(5) 



i=l i=l 

Thus, by Markov's inequality, w.p. > i we have 



m(m - 1) 100 10 

— 1 

2m 2 k 2 vnk 



< 



GO 



120 



»=i 



(6) 



Thus, with probability at least 1 — i — | > 0, both (6) and (5) holds. In particular, there exists a sample S 
for which both (6) and (5) hold. Let us fix such an S = {(xx,yi), . . . , (x m , y m )}- 
Assume now that <f> G {±l}[ fe l is sampled according to the first condition. Denote 



Yi = \{j ; h(xj) 4>{yj) and Vj = i}\/m. 
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For a fixed heWwe have 



Pr 



Err S0 (h)<fj,- - 



= Pr 



i=l 



We note that are independent random variables with E\Yi] > ^j5j and < < |5j. Thus, by Hoeffding's 
inequality, 

/ ,,2 \ 



Pr 



Err s (h) < n ■ 



< exp 



< exp 



240 / 



< 



em * 



By Sauer's lemma, \H\{ Xl ,.... Xm } 
choice of (f>, inf^ e ^ Errs , (h) > M — f and by (5) also 



Thus, with probability > 1 - (^j^exp (-fj§) over the 



Err^(H)> 



(7) 



Finally, since m = 0(Hf), if k = Q ^ IMlZ^+MjVj) ^ then Eq. (7) holds w.p > 1 - 5, concluding the 
proof for the case when the first condition holds. If the second condition holds, the proof is very similar, 
with the sole difference that Lemma A. 8 is used instead of Hoeffding's inequality. □ 

Proof of Corollary 3.9. The Corollary follows from Lemma 3.8, by noting that Err^T^r) > Err^('H), 
where </> : [k] — > {±1} is defined as cj)(i) = 1 if and only if A _1 (i) is in the right subtree emanating from 
the root of T. □ 

Proof of Corollary 3.10. Let <f> : [k] —> {±1} be the function that is —1 on [|_§J] an d 1 otherwise. By 
Lemma 4.1, applied to L(H) — <fi o %(M,id)> VC(0 o H(M,id)) — 0(dl\og(dl)), so that, by Lemma 3.8 
(applied to a random choice of A instead of <j>), Errp^ oA (0°W(m,W) ) > \ ~ v with probability > 1 — 6 over 
the choice of A. The proof follows as we note that for every A, Err£,('H( M .a- 1 )) = Err£> (H(M,id)) — 



(M,Id)J 



□ 
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